Skip to content

Conversation

@pafi-code
Copy link

Description

When using the OTEL http exporters the std.out gets polluted by unhandled error traces. I don't think that this should be the desired result. This happens when the endpoint is not available when exporting telemetry. This can be cause by different events but in general I think it should be better handled.
Therefore I added error handling with more specific and shorted error messages.

Type of change

I'm not sure if the change type is rather fix or feature, as the service was still running.

  • New feature (non-breaking change which adds functionality)

How Has This Been Tested?

  • Don't provide a collector at the configured endpoint

Does This PR Require a Contrib Repo Change?

  • Yes. - Link to PR:
  • No.

Checklist:

  • Followed the style guidelines of this project
  • Changelogs have been updated
  • Unit tests have been added
  • Documentation has been updated

@pafi-code pafi-code requested a review from a team as a code owner August 1, 2025 14:21
@linux-foundation-easycla
Copy link

linux-foundation-easycla bot commented Aug 1, 2025

CLA Signed

The committers listed above are authorized under a signed CLA.

@pafi-code pafi-code force-pushed the feature/add_error_handling_for_otlp_exporters branch from d3d4031 to afb76f2 Compare August 1, 2025 14:45
Comment on lines 174 to 189
try:
resp = self._export(serialized_data, deadline_sec - time())
except Exception as error:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would you mind opening an issue or point to some existing issue for the behavior you're encountering? We can discuss possible fixes there.

I think there is a bug in _export() on L157, where the code assumes the connection will succeed, and so the retry loop here is never executed. However the raised exception should be ultimately caught here

except Exception: # pylint: disable=broad-exception-caught
self._logger.exception(
"Exception while exporting %s.", self._exporting
)

If you just want the logs to go away, you can set a filter on the opentelemetry.sdk._shared_internal logger or disable it

Copy link
Author

@pafi-code pafi-code Aug 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea so the issue for me was that the logger logged the entire error trace to the std.out, which totally makes sense given the code that you shared.
So my goal was to still get to know when something fails and not just disable the logger, as I feel like the otlp_exporters should handle this internally.

On top of that I would like to generally comment that doing a retry over a try-except in _export() feels kinda odd to me.
Thanks for taking your time to look into this! :)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you still think I should open an issue for that generally?

Copy link
Member

@emdneto emdneto Oct 22, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@aabmass, from what I understand, this is basically when the HTTP endpoint is not available and the urllib call is unable to complete the request, and this outputs the full stacktrace of urllib with the connection refused error.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think an appropriate fix could be in export fix the retry logic under a try/catch, like:

try:
  resp = self._export(serialized_data, remaining_time)
except Exception as exc:
  is_retryable = True
else:
  is_retryable = _is_retryable(resp)

and use that variable in the if at L183. Maybe needs more adjustments. wdyt @aabmass @DylanRussell ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah makes sense to me @pafi-code

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So just to be clear: You want me to implement the changes regarding supporting the retry loop in case the request is not successful?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pafi-code if you are open to do that would be good, if not please lmk so I can file a PR, but I believe the correct approach is to fix the bug of the request not getting into the retry loop

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR is updated. Two questions:

  1. Are you fine with me excepting Exceptions in general or should I only handle ConnectionError? I assume ConnectionError is preferable.
  2. Are you fine with the log message in case of time out or max retries?

Copy link
Member

@emdneto emdneto Oct 29, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. @pafi-code let's catch requests.exceptions.RequestException, wdyt?
  2. My first thought on this fix was:
        try:
            resp = self._export(serialized_data, remaining_time)
            if resp.ok:
                return SpanExportResult.SUCCESS
        except requests.exceptions.RequestException as exc:
            reason = str(exc)
            retryable = True
            status_code = None
        else:
            retryable = _is_retryable(resp)
            status_code = resp.status_code
            reason = resp.reason

        # Compute jittered exponential backoff
        backoff = (2 ** retry_num) * random.uniform(0.8, 1.2)
        remaining_time = deadline - time()

        if (
            not retryable
            or retry_num + 1 == _MAX_RETRYS
            or backoff > remaining_time
            or self._shutdown
        ):
            if status_code is None:
                _logger.error("Failed to export span batch: %s", reason)
            else:
                _logger.error(
                    "Failed to export span batch code: %s, reason: %s",
                    status_code,
                    reason,
                )
            return SpanExportResult.FAILURE

        _logger.warning(
            "Transient error %s encountered while exporting span batch, retrying in %.2fs.",
            reason,
            backoff,
        )

        if is_shutdown_in_progress.wait(backoff):
            _logger.warning("Shutdown in progress, aborting retry.")
            break

    return SpanExportResult.FAILURE

@pafi-code
Copy link
Author

Okay so I know that the pipeline is failing and I assume this PR should be rebased, but for me there is still the question if my proposed fix is actually how it should be handled. Would appreciate some more feedback here. Also I've created an issue related to this PR: #4712

@pafi-code pafi-code force-pushed the feature/add_error_handling_for_otlp_exporters branch from afb76f2 to f581f22 Compare October 24, 2025 10:02
Copy link
Member

@emdneto emdneto left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pafi-code can we have tests for this?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: Reviewed PRs that need fixes

Development

Successfully merging this pull request may close these issues.

3 participants